10

Introduction

where k is the full precision kernels, w is the reconstructed matrix, v is the variance of y,

μ is the mean of the kernels, Ψ is the covariance of the kernels, fm are the features of class

m, and c is the mean of fm.

Zheng et al. [288] define a new quantization loss between binary weights and learned real

values, where they theoretically prove the necessity of minimizing the weight quantization

loss. Ding et al. [56] propose using distribution loss to explicitly regularize the activation

flow and develop a framework to formulate the loss systematically. Empirical results show

that the proposed distribution loss is robust to selecting the training hyper-parameters.

Reviewing these methods, they all aim to minimize the error and information loss of

quantization, which improves the compactness and capacity of 1-bit CNNs.

1.1.6

Neural Architecture Search

Neural architecture search (NAS) has attracted significant attention with remarkable perfor-

mance in various deep learning tasks. Impressive results have been shown for reinforcement

learning (RL), for example,[306]. Recent methods such as differentiable architecture search

(DARTs) [151] reduce search time by formulating the task in a differentiable manner. To

reduce redundancy in the network space, partially connected DARTs (PC-DARTs) were

recently introduced to perform a more efficient search without compromising DARTS per-

formance [265].

In Binarized Neural Architecture Search (BNAS) [35], the neural architecture search

is used to search BNNs, and the BNNs obtained by BNAS can outperform conventional

models by a large margin. Another natural approach is to use 1-bit CNNs to reduce the

computation and memory cost of NAS by taking advantage of the strengths of each in a

unified framework [304]. To accomplish this, a Child-Parent (CP) model is introduced to a

differentiable NAS to search the binarized architecture (Child) under the supervision of a

full precision model (Parent). In the search stage, the Child-Parent model uses an indicator

generated by the accuracy of the Child-Parent (cp) model to evaluate the performance

and abandon operations with less potential. In the training stage, a kernel-level CP loss

is introduced to optimize the binarized network. Extensive experiments demonstrate that

the proposed CP-NAS achieves a comparable accuracy with traditional NAS on both the

CIFAR and ImageNet databases.

Unlike conventional convolutions, BNAS is achieved by transforming all convolutions in

the search space O into binarized convolutions. They denote the full-precision and binarized

kernels as X and ˆX, respectively. A convolution operation in O is represented as Bj =

BiˆX, wheredenotes convolution. To build BNAS, a key step is to binarize the kernels

from X to ˆX, which can be implemented based on state-of-the-art BNNs, such as XNOR

or PCNN. To solve this, they introduce channel sampling and reduction in operating space

into differentiable NAS to significantly reduce the cost of GPU hours, leading to an efficient

BNAS.

1.1.7

Optimization

Researchers also explore new training methods to improve BNN performance. These meth-

ods are designed to handle the drawbacks of BNNs. Some borrow popular techniques from

other fields and integrate them into BNNs, while others make changes based on classical

BNNs training, such as improving the optimizer.

Sari et al. [234] find that the BatchNorm layer plays a significant role in avoiding explod-

ing gradients, so the standard initialization methods developed for full-precision networks

are irrelevant for BNNs. They also break down BatchNorm components into centering and